class: left, title-slide # Modelling
SBWL H & M: Data-based Storytelling
--- # Getting into modelling! .center[  ] [source](https://commons.wikimedia.org/wiki/File:ModelsCatwalk.jpg#/media/Archivo:ModelsCatwalk.jpg) --- # Required skills ___ - Data-science teams have different skill requirements: .center[  [Data Science and the Art of Persuasion](https://hbr.org/2019/01/data-science-and-the-art-of-persuasion) ] --- # What are we doing? ___ .center[ <iframe src="https://giphy.com/embed/Z45gyFc4vcgzgS9wkw" width="380" height="380" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/parksandrec-season-4-parks-and-recreation-rec-Z45gyFc4vcgzgS9wkw">via GIPHY</a></p>] --- # What are we doing? ___ - Four types of data stories: - Descriptive - *What has happened?* - Predictive - *What will happen?* - Diagnostic - *Why does it happen?* - Prescriptive - *What actions should be taken?* --- # What are we doing? ___ - Four types of data stories: - Descriptive - *What has happened?* - Predictive - *What will happen?* - **Causal Inference:** Diagnostic - *Why does it happen?* - **Causal Inference:** Prescriptive - *What actions should be taken?* --- # Descriptive ___ <!-- --> --- # Predictive ___ <!-- --> --- # Causal? ___  --- # Causal Inference vs. prediction ___ .dense[ - Variables can be *predictive* without a causal relationship - *Correlation does not imply causation* - Arcade revenue predicts CS doctorates (and vice versa) ] -- .dense[ - Variables can not be *predictive* but have a causal relationship - *No correlation does not imply no causation* - Fuel used and speed on cruise control (uphill vs. flat) - What about the correlation of speed and slope? ] -- .dense[ - Variables can be predictive while not being *predictive* > - [*Correlation does not even imply correlation*](https://statmodeling.stat.columbia.edu/2014/08/04/correlation-even-imply-correlation/) > <footer>- Andrew Gelman</footer> ] --- # Causal Inference vs. prediction ___ .dense[ - Variables can be *predictive* without a causal relationship - *Correlation does not imply causation* - Arcade revenue predicts CS doctorates (and vice versa) ] .dense[ - Variables can not be *predictive* but have a causal relationship - *No correlation does not imply no causation* - Fuel used and speed on cruise control (uphill vs. flat) - What about the correlation of speed and slope? ] .dense[ - Variables can be predictive (in sample) while not being *predictive* (in population) - [*Correlation does not even imply correlation*](https://statmodeling.stat.columbia.edu/2014/08/04/correlation-even-imply-correlation/) - There might be a correlation in the data but not in the population ] --- class: logo-small # Example: Causal but no correlation ___ <!-- --> --- # Always visualize! ___ .pull-left[ <table style="width:110%;"> <thead> <tr> <th style="text-align:left;"> data </th> <th style="text-align:right;"> mean x </th> <th style="text-align:right;"> mean y </th> <th style="text-align:right;"> sd x </th> <th style="text-align:right;"> sd y </th> <th style="text-align:right;"> corr x,y </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;width: 5cm; "> away </td> <td style="text-align:right;"> 54.27 </td> <td style="text-align:right;"> 47.83 </td> <td style="text-align:right;"> 16.77 </td> <td style="text-align:right;"> 26.94 </td> <td style="text-align:right;"> -0.06 </td> </tr> <tr> <td style="text-align:left;width: 5cm; "> bullseye </td> <td style="text-align:right;"> 54.27 </td> <td style="text-align:right;"> 47.83 </td> <td style="text-align:right;"> 16.77 </td> <td style="text-align:right;"> 26.94 </td> <td style="text-align:right;"> -0.07 </td> </tr> <tr> <td style="text-align:left;width: 5cm; "> circle </td> <td style="text-align:right;"> 54.27 </td> <td style="text-align:right;"> 47.84 </td> <td style="text-align:right;"> 16.76 </td> <td style="text-align:right;"> 26.93 </td> <td style="text-align:right;"> -0.07 </td> </tr> <tr> <td style="text-align:left;width: 5cm; "> dino </td> <td style="text-align:right;"> 54.26 </td> <td style="text-align:right;"> 47.83 </td> <td style="text-align:right;"> 16.77 </td> <td style="text-align:right;"> 26.94 </td> <td style="text-align:right;"> -0.06 </td> </tr> <tr> <td style="text-align:left;width: 5cm; "> high lines </td> <td style="text-align:right;"> 54.27 </td> <td style="text-align:right;"> 47.84 </td> <td style="text-align:right;"> 16.77 </td> <td style="text-align:right;"> 26.94 </td> <td style="text-align:right;"> -0.07 </td> </tr> <tr> <td style="text-align:left;width: 5cm; "> star </td> <td style="text-align:right;"> 54.27 </td> <td style="text-align:right;"> 47.84 </td> <td style="text-align:right;"> 16.77 </td> <td style="text-align:right;"> 26.93 </td> <td style="text-align:right;"> -0.06 </td> </tr> <tr> <td style="text-align:left;width: 5cm; "> wide lines </td> <td style="text-align:right;"> 54.27 </td> <td style="text-align:right;"> 47.83 </td> <td style="text-align:right;"> 16.77 </td> <td style="text-align:right;"> 26.94 </td> <td style="text-align:right;"> -0.07 </td> </tr> <tr> <td style="text-align:left;width: 5cm; "> x shape </td> <td style="text-align:right;"> 54.26 </td> <td style="text-align:right;"> 47.84 </td> <td style="text-align:right;"> 16.77 </td> <td style="text-align:right;"> 26.93 </td> <td style="text-align:right;"> -0.07 </td> </tr> </tbody> </table> ] -- .pull-right[ <!-- --> ] --- # Example: Family size ___ - Data: Pairs of moms and daughters - Family size - Birth order - Question: Causal effect of mom's family size on daughter's? --- # Simulation: No effect ___ <!-- --> --- # Example: Sales and Marketing spending ___ - Data: Sales and marketing spending - Question: Causal effect of marketing spending on sales - Additional knowledge: Marketing budget is based on economic outlook - Customers base spending on economic outlook --- # Simulation: Confounds ___ <!-- --> --- # Take-Aways ___ - Theoretical knowledge about the subject at hand (data being modelled) is crucial - Before a statistical model we need a causal model - Causal model: Selection of variables (analysis of missing data problems) - Statistical model: Functional form, method, ... `\(\Rightarrow\)` suitable for data - *Sometimes* its good to add more (control) variables, *sometimes* it is not - Predictive power does not help us in deciding which variables to add - p-values are liars! --- # Causal Inference ___ .pull-left[ - **Can we predict the effects of an intervention?** - We all go to the arcade `\(\Rightarrow\)` more CS doctorates? - Classic example: supply and demand (e.g., Wright, 1928) - If we increase marketing spending, will sales go up and by how much?] -- .pull-right[ - **Can we impute counterfactuals?** - A customer was not targeted by a social media campaign and did not buy the product (observed)? - Would that customer buy the product if they had been targeted (unobserved)? ] --- class: logo-small hide-footer ## Causal Inference: two approaches <font size="3">(see Imbens, 2020)</font> ___ .pull-left[ .dense[ - **Directed Acyclic Graphs (DAGs)** - Concerned with identification of causal relationships - Shows direction of causality and important variables - Graphical representation: <!-- --> ] ] -- .pull-right[ .dense[ - **Potential Outcome** - Multiple Treatments / Causes <br> <font size="3">e.g., exposure to ad</font> - Potential outcomes f. treatments <br> <font size="3">e.g., Purchase given exposure / no exposure</font> - Multiple observations with different treatments <br> <font size="3">e.g., A/B test</font> - Focus on assignment of treatment <br> <font size="3">e.g., randomized experiment, selection on (un)observables</font> ] ] --- # DAGs ___ .pull-left[ - Our focus:<br> **Identification of causal relationships** - Identification of data requirements - Improvement of Models - Interpretation of Models - Correct use for business decision making ] -- .pull-right[ - Not our focus: Types of relationships - Functional form - Sign of relationship - Take (micro)econometrics/ML classes! - Read [Statistical Rethinking](https://xcelab.net/rm/statistical-rethinking/) ] --- # DAG Examples ___ <!-- --> --- # DAG Examples ___ <!-- --> --- # DAG Examples ___ <!-- --> --- class: hide-footer hide-logo-bottom # Analyzing DAGs: **d-separation** ___ .dense[ - Necessary to decide which variables to use in model - "d" stands for "directional" - Usually we are dealing with more than two variables - Complication: causation flows only directed - association might flow against ] -- <!-- --> --- class: hide-footer # Analyzing DAGs: Fork ___ .pull-left[ <!-- --> ] .pull-right[ .dense[ - d causes both x and y1 - Arrows pointing to x are called "back-door" paths - Eliminated by randomized experiment! Why? - Controlling for d "blocks" the non-causal association x `\(\rightarrow\)` y1 ] ] --- # Analyzing DAGs: Pipe ___ <!-- --> .dense[ - x causes y through z - Controlling for z blocks the causal association x `\(\rightarrow\)` y2 ] --- class: hide-footer # Analyzing DAGS: Colliders ___ <!-- --> .dense[ - x and y3 cause a (bot not each other) - Controlling for a opens the non-causal association x `\(\rightarrow\)` y3 ] --- # Colliders: What is happening? ___ <!-- --> -- .dense[ - If I know the amount of hours you work and your income, I can guess your level of education - Given just the hours you work I have no idea about your education ] --- # Multiple paths ___ <!-- --> - Treat each one separately! --- # Multiple paths ___ <!-- --> --- class: logo-small # Common bad controls <font size="5">(Cinelli, Forney, and Pearl, 2020)</font> ___ <!-- --> --- # Common bad controls ___ <!-- --> --- # Take-aways for causal analysis ___ .dense[ - Variable selection should be based on DAG - Not statistical criteria (R-squared, AIC, p-values) - Models are designed for one specific causal effect - No causal interpretation for confounding variables! - Each causal question needs a model! - see Westreich and Greenland (2013) - Think about things you do not observe - This is essential for *correct* decision making in a business setting - The fanciest methodology will not tell you how to model! - Finding association does not mean we can predict the effect of intervention ] --- # Revisiting: bad Marketing ___ <!-- --> --- # References ___ .scrollable[ ### Papers & Books Cinelli, C., A. Forney, and J. Pearl (2020). "A crash course in good and bad controls". In: _SSRN 3689437_. Cunningham, S. (2021). _Causal inference - The Mixtape_. Yale University Press. URL: [https://mixtape.scunning.com/index.html](https://mixtape.scunning.com/index.html). Imbens, G. W. (2020). "Potential Outcome and Directed Acyclic Graph Approaches to Causality: Relevance for Empirical Practice in Economics". In: _Journal of Economic Literature_ 58.4, pp. 1129-79. DOI: [10.1257/jel.20191597](https://doi.org/10.1257%2Fjel.20191597). URL: [https://www.aeaweb.org/articles?id=10.1257/jel.20191597](https://www.aeaweb.org/articles?id=10.1257/jel.20191597). Locke, S. and L. D'Agostino McGowan (2018). _datasauRus: Datasets from the Datasaurus Dozen_. R package version 0.1.4. URL: [https://CRAN.R-project.org/package=datasauRus](https://CRAN.R-project.org/package=datasauRus). McElreath, R. (2020). _Statistical rethinking: A Bayesian course with examples in R and Stan_. Chapman and Hall/CRC. Morgan, S. L. and C. Winship (2015). _Counterfactuals and causal inference - Methods and Principles for Social Research_. 2nd Edition. Cambridge University Press. Pearl, J. (2009). "Causal inference in statistics: An overview". In: _Statistics surveys_ 3, pp. 96-146. Pearl, J. and others (2000). "Models, reasoning and inference". In: _Cambridge, UK: CambridgeUniversityPress_ 19, p. 2. Westreich, D. and S. Greenland (2013). "The table 2 fallacy: presenting and interpreting confounder and modifier coefficients". In: _American journal of epidemiology_ 177.4, pp. 292-298. Wright, P. G. (1928). _Tariff on animal and vegetable oils_. Macmillan Company, New York. ### Links [Data Science and the Art of Persuasion](https://hbr.org/2019/01/data-science-and-the-art-of-persuasion) [Descriptive, Predictive, Prescriptive, and Diagnostic Analytics: A Quick Guide](https://www.sigmacomputing.com/blog/descriptive-predictive-prescriptive-and-diagnostic-analytics-a-quick-guide/) [Causal Salad (link to lecture at the bottom)](https://github.com/rmcelreath/causal_salad_2021) [Milton Friedman's Thermostat](https://themonkeycage.org/2012/07/milton-friedmans-thermostat/) [Correlation does not even imply correlation](https://statmodeling.stat.columbia.edu/2014/08/04/correlation-even-imply-correlation/) [d-SEPARATION WITHOUT TEARS](http://bayes.cs.ucla.edu/BOOK-2K/d-sep.html) ]